Clear Current Environment

rm(list = ls())

Load Key packages

require(dplyr)
require(magrittr)
require(webshot)

Read in data

df <- read.csv("data.txt", header = TRUE)

Model Selection Process

Dataset includes 95,775 rows and 481 unique variables.

To begin my modeling process, I will be selecting 13 variables. The first is “TARGET_B,” what we are investigating. The other 12 I believe could have a strong impact on the target variable. The data types of the variables I selected to investigate include the following: 7 categorical and 5 continuous. To increase efficiency of program run-time I will be creating a subset of the original data set. This subset includes only the 12 variables I believe have the potential to be key in predicting whether or not an individual will donate; saving the company money on mailings, while also getting expected returns on donations. In turn the veterans non-profit organization should have greater gross profit from increased savings and similar donation returns.

Note: This subsetted data frame will come into use in modeling portion of the report, following the visualization section.

Categorical Variables Include the Following: TARGET_B, VETERANS, HOMEOWNR, BIBLE, WEALTH2, GENDER, INCOME, and NGIFTALL Continuous Variables Include the Following: AGE, POP901, RAMNTALL, TIMELAG

Data Definitions

Categorical: TARGET_B -Whether or not the individual donated VETERANS - Whether or not individual was a veteran HOMEOWNR - Home Owner BIBLE - Reads Bible WEALTH2 - Wealth Rating GENDER - Male or Female STATE - State individual resides in INCOME - Household Income (Based on levels)

Continuous AGE - Overlay Age POP901 - Number of persons in neighborhood RAMNTALL - Dollar amount of lifetime gifts to date NGIFTALL - Number of lifetime gifts to date TIMELAG - Number of months between first and second gift

Create subset of data using dplyr select statement and magrittr piping

subDf <- df %>% select(TARGET_B, VETERANS, HOMEOWNR, BIBLE, WEALTH2, GENDER, STATE, INCOME, AGE, POP901, RAMNTALL, NGIFTALL, TIMELAG)

Show structure of new data frame

subDf$TARGET_B <- as.factor(subDf$TARGET_B)
subDf$WEALTH2 <- as.factor(subDf$WEALTH2)
str(subDf)
## 'data.frame':    95775 obs. of  13 variables:
##  $ TARGET_B: Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ VETERANS: Factor w/ 2 levels " ","Y": 1 1 1 1 1 1 1 1 1 1 ...
##  $ HOMEOWNR: Factor w/ 3 levels " ","H","U": 1 2 3 3 2 1 2 3 3 1 ...
##  $ BIBLE   : Factor w/ 2 levels " ","Y": 1 1 1 1 2 1 1 1 1 1 ...
##  $ WEALTH2 : Factor w/ 10 levels "0","1","2","3",..: 6 10 2 1 NA NA 4 NA 3 10 ...
##  $ GENDER  : Factor w/ 6 levels " ","C","F","J",..: 3 5 5 3 3 1 3 3 5 5 ...
##  $ STATE   : Factor w/ 57 levels "AA","AE","AK",..: 20 9 33 9 14 4 21 24 18 48 ...
##  $ INCOME  : int  NA 6 3 1 3 NA 4 2 3 NA ...
##  $ AGE     : int  60 46 NA 70 78 NA 38 NA NA 65 ...
##  $ POP901  : int  992 3611 7001 640 2520 18172 1067 1485 2268 2607 ...
##  $ RAMNTALL: num  240 47 202 109 254 51 107 31 199 28 ...
##  $ NGIFTALL: int  31 3 27 16 37 4 14 5 11 3 ...
##  $ TIMELAG : int  4 18 12 9 14 6 4 6 8 7 ...

Show total donators vs non-donators

table(subDf$TARGET_B)
## 
##     0     1 
## 90186  5015

Percentage of donators

round(length(which(subDf$TARGET_B == 1))/length(subDf$TARGET_B) * 100, 2)
## [1] 5.24

Visualize Data

In this section I will be visually investigating to see how the variables I perceived to be valuable affect both TARGET_B; the binary yes or no donation variable, as well as TARGET_D; the continuous variable that tells how much an individual donated to the paralyzed veterans organization

Load visualization packages

require(ggplot2)
require(plotly)
require(fiftystater)

Show barplot of donatations

Plot in R

# Convert binary data from 0 and 1 to text
plotBar <- df
plotBar$TARGET_B <- ifelse(df$TARGET_B == 0,"Non-Donator","Donator")

# Plot
donateBar <- ggplot(data = plotBar, aes(x = TARGET_B, fill = TARGET_B))
donateBar <- donateBar + geom_bar()
donateBar <- donateBar + xlab("Donated")
donateBar <- donateBar + ylab("Number")
donateBar <- donateBar + ggtitle("Donator vs. Non-Donatator")
donateBar

Show density of TARGET_D of donations

ggplot(df, aes(TARGET_D, fill="Mean Donated")) +
  geom_density(adjust=5) +
  xlim(0, 35) +
  ggtitle("Kernal Density Plot of Donation Totals") +
  xlab("Amount Donated") +
  ylab("Density")

It appears the majority of individuals who were mailed, did not donate; and those who did, typically donated little

Stacked bar graph to show proportion of VETERANs who are non-donators and donators compared to the rest of the population

plotBar$VETERANS <- ifelse(df$VETERANS == "Y","Yes","No")
g <- ggplot(plotBar, aes(TARGET_B))
g <- g + geom_bar()
g + geom_bar(aes(fill = VETERANS))

According to this visualization, it appears that non-donating and donating VETERANS are approximately similar to non-veterans. Perhaps the VETERANS variable is not a good predictor for donations

Boxplot of Gender Donations

# Create Data Frame with Donation and gender
genderDonation <- df %>%
  select(GENDER, TARGET_D) 

# Only show subset of individuals who have donated greater than $0
genderDonation <-  subset(genderDonation, TARGET_D > 0)

# Use plotly to show range of donations by gender
plot_ly(genderDonation, x = genderDonation$GENDER, y = genderDonation$TARGET_D, color = genderDonation$GENDER, type = "box") %>%
  layout(title = "Donation Range by Gender",
           xaxis = list(title="Gender"),
           yaxis = list(title="Donation Amount"))

Important Notes on factor levels: J = Joint Account, U = Unkown, ‘blank’ = not listed, F = Female, M = Male C = No donations

This boxplot shows the range of donations by gender (Considering only the individuals who donated). Males appear to have the highest of all the 5 number summary values except for min and a tie for max with female

Show total number of individuals who donated by state

# Create data frame with state and yes or no donated variable
stateCountDonated <- df %>% select(STATE, TARGET_B)

# Keep only individuals who donated
stateCountDonated <- subset(stateCountDonated, TARGET_B > 0)

stateCountDonated <- stateCountDonated %>%
  group_by(STATE) %>% 
  summarise(count_donated = sum(TARGET_B))

plot_ly(stateCountDonated, 
        x = stateCountDonated$STATE, y = stateCountDonated$count_donated, type = "scatter", mode = "lines") %>% 
  layout(title = "Number of Individuals who Donated by State",
           xaxis = list(title="State"),
           yaxis = list(title="Donator Count"))

California has the highest amount of donators at 1295 with Florida in second at 850. While California has many donators, it must be kept in mind that they also have a very large population that could affect results

Show average donation by state (only counting those who donated)

# Create data frame with state and donation amount
stateAvgDonation <-  df %>% select(STATE, TARGET_D)

# Subset to only have data on individuals who donated
stateAvgDonation <-  subset(stateAvgDonation, TARGET_D > 0)

# Create grouping
stateAvgDonation <- stateAvgDonation %>%
  group_by(STATE) %>% 
  summarise(state_avg = mean(TARGET_D))

plot_ly(stateAvgDonation, 
        x = stateAvgDonation$STATE, y = stateAvgDonation$state_avg, type = "scatter", mode = "lines") %>% 
  layout(title = "Average Donation by State for Individuals who Donated",
           xaxis = list(title="State"),
           yaxis = list(title="Donation Average"))

It appears that “AE” is the highest donated. AE stands for Armed Forces Africa. It makes sense that current military individuals would be generous in donations. For actual states, Florida is the highest average at $25.45 per donation and Idaho is in close second at $24.84 per donation.

Florida may be a very good state to target, as it has the second highest number of donators and the highest average of Dollars per donation in the Continental United States

Visual relationship between continuous variables age and ramntall (Dollar amount of lifetime gifts to date)

plot_ly(df, x = ~AGE, y = ~RAMNTALL, color = ~TARGET_D) %>%
    layout(title = "Age, Total Donation, and Individual Donation Time Amount",
           xaxis = list(title="Age"),
           yaxis = list(title="Life Time Donations"))

Many different understandings of the data can be gathered from this three-dimensional visualization. First is that there appears to be a slight increase in total giving from individuals at age 20-40. At age 40 it seems to even for the most part excluding outliers. It also appears that those who give the largest donations one time, are not necessarily the most generous over time. In fact the data shows the most extreme giver has given close to $6,000 over his life, but has not donated more than $50 at one time. Individuals who donated and that are less than 10 years old can be assumed to not be the real donator, but have had the donation made in their name

Relationships of my variables in subsetted data frame

Load Corrplot package

require(corrplot)

Load packages for model building

require(caret)
require(e1071)
require(rpart)
require(rpart.plot)
require(gbm)
require(InformationValue)

Machine Learning Modeling

Convert binary variables to 1s and 0s

subDf$VETERANS <- ifelse(subDf$VETERANS == "Y","1","0")
subDf$BIBLE <- ifelse(subDf$BIBLE == "Y","1","0")

Create 10-Fold validation configuration of control

fitControl <- trainControl(method = "cv", number = 10)

Training a logistic Regression Model

# Remove na's to make modeling possible
noNaSub <- na.omit(subDf)
set.seed(1234)
logFit <- glm(TARGET_B ~.,
              family=binomial(link='logit'),
              data = noNaSub)
summary(logFit)
## 
## Call:
## glm(formula = TARGET_B ~ ., family = binomial(link = "logit"), 
##     data = noNaSub)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.6459  -0.3305  -0.2198  -0.1643   3.6240  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -9.252e-03  1.248e+03   0.000  0.99999    
## VETERANS1   -2.192e-01  8.012e-02  -2.736  0.00622 ** 
## HOMEOWNRH    1.283e-02  2.257e-01   0.057  0.95467    
## HOMEOWNRU   -3.425e-01  2.330e-01  -1.470  0.14155    
## BIBLE1       1.111e+00  6.257e-02  17.749  < 2e-16 ***
## WEALTH21    -1.272e-01  1.623e-01  -0.784  0.43333    
## WEALTH22    -7.007e-01  1.611e-01  -4.349 1.37e-05 ***
## WEALTH23     1.122e+00  1.323e-01   8.484  < 2e-16 ***
## WEALTH24    -2.259e-01  1.579e-01  -1.431  0.15237    
## WEALTH25    -2.579e-01  1.512e-01  -1.706  0.08809 .  
## WEALTH26    -3.228e-03  1.472e-01  -0.022  0.98251    
## WEALTH27     1.662e-01  1.461e-01   1.138  0.25530    
## WEALTH28     1.021e-01  1.474e-01   0.693  0.48854    
## WEALTH29    -4.227e-01  1.477e-01  -2.861  0.00422 ** 
## GENDERF      1.229e+01  8.827e+02   0.014  0.98889    
## GENDERJ      1.244e+01  8.827e+02   0.014  0.98875    
## GENDERM      1.266e+01  8.827e+02   0.014  0.98855    
## GENDERU      1.196e+01  8.827e+02   0.014  0.98919    
## STATEAE     -2.826e+01  1.075e+03  -0.026  0.97902    
## STATEAK     -1.774e+01  8.827e+02  -0.020  0.98396    
## STATEAL     -1.747e+01  8.827e+02  -0.020  0.98421    
## STATEAR     -1.751e+01  8.827e+02  -0.020  0.98417    
## STATEAZ     -1.726e+01  8.827e+02  -0.020  0.98440    
## STATECA     -1.651e+01  8.827e+02  -0.019  0.98508    
## STATECO     -1.716e+01  8.827e+02  -0.019  0.98449    
## STATECT     -2.907e+01  1.011e+03  -0.029  0.97705    
## STATEDE     -2.805e+01  1.248e+03  -0.022  0.98208    
## STATEFL     -1.755e+01  8.827e+02  -0.020  0.98414    
## STATEGA     -1.744e+01  8.827e+02  -0.020  0.98424    
## STATEHI     -1.647e+01  8.827e+02  -0.019  0.98511    
## STATEIA     -1.921e+01  8.827e+02  -0.022  0.98264    
## STATEID     -1.899e+01  8.827e+02  -0.022  0.98283    
## STATEIL     -1.796e+01  8.827e+02  -0.020  0.98377    
## STATEIN     -1.833e+01  8.827e+02  -0.021  0.98343    
## STATEKS     -1.719e+01  8.827e+02  -0.019  0.98446    
## STATEKY     -1.725e+01  8.827e+02  -0.020  0.98441    
## STATELA     -1.744e+01  8.827e+02  -0.020  0.98424    
## STATEMA     -2.825e+01  1.018e+03  -0.028  0.97786    
## STATEMD     -1.628e+01  8.827e+02  -0.018  0.98529    
## STATEME     -2.895e+01  1.018e+03  -0.028  0.97732    
## STATEMI     -1.715e+01  8.827e+02  -0.019  0.98450    
## STATEMN     -1.846e+01  8.827e+02  -0.021  0.98331    
## STATEMO     -1.727e+01  8.827e+02  -0.020  0.98439    
## STATEMS     -1.748e+01  8.827e+02  -0.020  0.98420    
## STATEMT     -1.720e+01  8.827e+02  -0.019  0.98446    
## STATENC     -1.749e+01  8.827e+02  -0.020  0.98419    
## STATEND     -1.695e+01  8.827e+02  -0.019  0.98468    
## STATENE     -1.644e+01  8.827e+02  -0.019  0.98514    
## STATENH     -2.855e+01  9.642e+02  -0.030  0.97638    
## STATENJ     -2.954e+01  1.248e+03  -0.024  0.98112    
## STATENM     -1.670e+01  8.827e+02  -0.019  0.98490    
## STATENV     -1.649e+01  8.827e+02  -0.019  0.98510    
## STATENY     -2.837e+01  9.242e+02  -0.031  0.97551    
## STATEOH     -2.922e+01  9.288e+02  -0.031  0.97490    
## STATEOK     -1.759e+01  8.827e+02  -0.020  0.98411    
## STATEOR     -1.723e+01  8.827e+02  -0.020  0.98443    
## STATEPA     -1.647e+01  8.827e+02  -0.019  0.98512    
## STATERI     -2.833e+01  1.248e+03  -0.023  0.98190    
## STATESC     -1.700e+01  8.827e+02  -0.019  0.98464    
## STATESD     -1.647e+01  8.827e+02  -0.019  0.98511    
## STATETN     -1.755e+01  8.827e+02  -0.020  0.98414    
## STATETX     -1.727e+01  8.827e+02  -0.020  0.98439    
## STATEUT     -1.824e+01  8.827e+02  -0.021  0.98352    
## STATEVA     -2.864e+01  9.812e+02  -0.029  0.97672    
## STATEVT     -2.975e+01  1.248e+03  -0.024  0.98099    
## STATEWA     -1.697e+01  8.827e+02  -0.019  0.98466    
## STATEWI     -1.735e+01  8.827e+02  -0.020  0.98432    
## STATEWV     -1.494e+01  8.827e+02  -0.017  0.98649    
## STATEWY     -1.802e+01  8.827e+02  -0.020  0.98371    
## INCOME       3.575e-02  1.793e-02   1.993  0.04623 *  
## AGE          1.823e-02  1.834e-03   9.942  < 2e-16 ***
## POP901      -1.326e-05  6.167e-06  -2.150  0.03154 *  
## RAMNTALL    -2.531e-03  3.894e-04  -6.500 8.04e-11 ***
## NGIFTALL     3.889e-02  3.907e-03   9.953  < 2e-16 ***
## TIMELAG      2.529e-02  4.125e-03   6.130 8.77e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 14604  on 34625  degrees of freedom
## Residual deviance: 11632  on 34551  degrees of freedom
## AIC: 11782
## 
## Number of Fisher Scoring iterations: 13

Many of the variables have very low p-values which show that they could be strongly related to Target_B

Accuracy of model

# Predict on data
pred <- predict(logFit, newdata = noNaSub, type = "response")

prediction <- ifelse(pred > 0.5, 1, 0)
real <- noNaSub$TARGET_B

# Show accuracy
mean(prediction == real)
## [1] 0.9619939

96 % Accuaracy sounds good in theory, the confusion matrix below should indicate how the model did in predicting the binary variables independently

Logistic Regression Model Confusion Matrix

caret::confusionMatrix(prediction, real, positive="1", mode="everything")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     0     1
##          0 32734  1303
##          1    13   576
##                                          
##                Accuracy : 0.962          
##                  95% CI : (0.9599, 0.964)
##     No Information Rate : 0.9457         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.4526         
##  Mcnemar's Test P-Value : < 2.2e-16      
##                                          
##             Sensitivity : 0.30655        
##             Specificity : 0.99960        
##          Pos Pred Value : 0.97793        
##          Neg Pred Value : 0.96172        
##               Precision : 0.97793        
##                  Recall : 0.30655        
##                      F1 : 0.46677        
##              Prevalence : 0.05427        
##          Detection Rate : 0.01663        
##    Detection Prevalence : 0.01701        
##       Balanced Accuracy : 0.65307        
##                                          
##        'Positive' Class : 1              
## 

With a high accuracy, high range of confidence intervals, low p-value, and accurate prediction of non-donators, the model appears to be in good shape. However, the model appears to have issue with False Negative Errors, meaning that the model predicts the individual to not be a donator when they actually are.

Plot ROC Curve to visualize specificity and sensitivity

InformationValue::plotROC(real, prediction)

InformationValue::AUROC(real, prediction)
## [1] 0.6530137

The ROC Curve is moderately high.

KS Chart

ks_plot(real, prediction)

Calculate the McFadden R2 for the logistic regression model

# Fit null model
nullModel <- glm(TARGET_B ~ 1,family=binomial(link='logit'), data=subDf)

# Calculate McFadden R2
cat("McFadden pseudo R2 = ", 1-logLik(logFit)/logLik(nullModel))
## McFadden pseudo R2 =  0.7039165

An R2 value of .70 is fairly decent.

Remove na’s to make decision tree type modeling possible

noNaSub <- na.omit(subDf)

Training a decision tree model

# Set seed to obtain same results each time
set.seed(1234)

# Train Decision Tree Model
decisionTree <- train(TARGET_B ~.,
            data = noNaSub,
            method = "rpart",
            trControl = fitControl,
            tuneLength = 10,
            parms=list(split='information'))

print(decisionTree)
## CART 
## 
## 34626 samples
##    12 predictor
##     2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 31164, 31164, 31164, 31163, 31163, 31163, ... 
## Resampling results across tuning parameters:
## 
##   cp          Accuracy   Kappa    
##   0.00000000  0.9563623  0.4185851
##   0.01463544  0.9620806  0.4525365
##   0.02927089  0.9620806  0.4525365
##   0.04390633  0.9611853  0.4467983
##   0.05854178  0.9611853  0.4467983
##   0.07317722  0.9611853  0.4467983
##   0.08781267  0.9611853  0.4467983
##   0.10244811  0.9493732  0.1221204
##   0.11708356  0.9493732  0.1221204
##   0.13171900  0.9493732  0.1221204
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.02927089.

Visualize the decision tree

rpart.plot::rpart.plot(decisionTree$finalModel)

Training Gradient Boosted Machine Model

# Set seed to obtain same result each run
set.seed(1234)

# Train gradient boosted model
gbmFit <- train(TARGET_B ~.,
                 data = noNaSub,
                 trControl = fitControl,
                 method = "gbm",
                 verbose = F)

print(gbmFit)
## Stochastic Gradient Boosting 
## 
## 34626 samples
##    12 predictor
##     2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 31164, 31164, 31164, 31163, 31163, 31163, ... 
## Resampling results across tuning parameters:
## 
##   interaction.depth  n.trees  Accuracy   Kappa    
##   1                   50      0.9623116  0.4538439
##   1                  100      0.9621961  0.4529664
##   1                  150      0.9621961  0.4529664
##   2                   50      0.9621672  0.4527578
##   2                  100      0.9621383  0.4525526
##   2                  150      0.9621383  0.4525526
##   3                   50      0.9622827  0.4536248
##   3                  100      0.9621961  0.4529784
##   3                  150      0.9621961  0.4529784
## 
## Tuning parameter 'shrinkage' was held constant at a value of 0.1
## 
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were n.trees = 50, interaction.depth
##  = 1, shrinkage = 0.1 and n.minobsinnode = 10.

Comparision of Models

All three model types predict at extremly high accuracy rates. Unfortunately due to the sensitivity of the data, the models struggled to accurately predict when individuals would donate, resulting in type 2 errors in the logistic regression model, and lower kappa scores on the decision tree, and gradient boosted machine models. I believe that there is potential in the model due to its extremely high accuracy at predicting non-donors. However, the goal was to eliminate non-donors from the mailing pool to decrease costs. The models very accurately determine non-donors and do find many donors as well. While the models are not perfect in the sense they always predict donors at their current states; using these models will result in large savings as it predicts who will not donate, and does always infer sure donators as well.